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Abstract.  The  current  revolution  in  computer  hardware  and  architecture  is  having  a  major  impact 
on  many  disciplines.  In  this  paper  we  examine  some  of  the  implications  of  this  revolution  for  some 
problems  in  computational  underwater  acoustics.  In  particular  we  consider  the  application  of  some 
model  multi-array  processor  systems  to  the  numerical  solution  of  the  three-dimensional  parabolic 
approximation,  [2,  3]  for  a  discussion  of  parabolic  approximations  in  underwater  acoustics. 

Our  goal  is  to  illuminate  some  of  the  issues  involved  in  taking  advantage  of  the  current  advances 
in  hardware  technology  to  produce  an  extraordinarily  fast,  inexpensive,  and  portable  computer 
system  for  simulating  underwater  acoustic  wave  propagation. 
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1.  Introduction 

The  current  revolution  in  computer  hardware  and  architecture  is  having  a  major  impact  on 
many  disciplines.  In  this  paper  we  examine  some  of  the  implications  of  this  revolution  for  some 
problems  in  computational  underwater  acoustics.  In  particular  we  consider  the  application  of  some 
model  multi-array  processor  systems  to  the  numerical  solution  of  the  three-dimensional  parabolic 
approximation,  [2,  3]  for  a  discussion  of  parabolic  approximations  in  underwater  acoustics. 

Our  goal  is  to  illuminate  some  of  the  issues  involved  in  taking  advantage  of  the  current  advances 
in  hardware  technology  to  produce  an  extraordinarily  fast,  inexpensive,  and  portable  computer 
system  for  simulating  underwater  acoustic  wave  propagation.  We  refer  to  the  work  of  Fred  Tappert 
[4]  for  a  discussion  of  some  of  the  issues  involved  in  using  a  single  array  processor  system,  which 
he  calls  PESOGEN  (Parablic  Equation  Solution  GENerator),  for  these  problems. 
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fzm  A  typical  multi-array  processor  system  would  look  something  like  above  sketch  where  the  bus 

has  a  total  bandwidth  of  B  bytes/second,  AP  represents  an  array  processor  which  can  communicate 
with  the  bus  at  b  bytes/second  and  compute  at  s  multiply/add  pairs  per  second.  In  Table  1  we 
present  a  list  of  contemporary  array  processors,  their  dates  of  introduction,  their  megaflop  ratings, 
r  and  their  approximate  costs. 


(2)  APTEC  D PS-2400  (a  fast  UNIBUS): 


(3)  YALE  SHARED-BULK-MEMORY  (ELI  CIRCUS): 


This  yields  an  effective  bandwidth  of  20  megabvtes/sec  between  every  pair  of  adjacent  array  pro¬ 
cessors  in  parallel.  This  system  has  been  implemented  at  Yale  with  a  single  FPS-164  and  a  single 
(bulk)  memory  containing  32  megabytes  and  expandable  to  796  megabytes.  It  is  being  implemented 
by  Enrico  dementi  at  IBM  with  an  IBM  mainframe  host  and  at  least  10  FPS-164s. 

It  is  currently  very  fashionable  to  talk  about  having  special  purpose  devices  integrated  into 
computer  systems.  In  particular,  we  know  how  to  build  very  fast  FFT  devices  which  could  be 
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integrated  into  a  typical  array  processor  as  above  (ALU  stands  for  arithmetic-logic  unit).  Suppose 
we  wish  to  do  complex  FFTs  based  on  «  =  1024  points.  We  assume  that,  the  data  resides  in 
memory,  must  be  moved  to  the  FFT  box  and  then  moved  back  to  memory  after  being  transformed. 
Furthermore,  there  is  little  point  to  building  the  FFT  box  unless  this  process  is  compute  bound 
(as  distinct  from  I/O  or  data  movement  bound),  i.e.. 


16n  4nloen 


data  movement  computation 

(4  byte  complex  arithmetic) 


or 


5  1 

—  <  -  log  n  =  2.5  or 
B  ~  4  6 


s  <  2.5  x  B. 


This  inequality  tells  us  that  for  this  situation  there  is  little  point  in  making  the  FFT  box  arbitrarily 
fast.  In  fact,  an  upper  bound  for  its  speed  is  provided  in  terms  of  the  bus  bandwidth. 


Our  general  approach  is  based  on  what  we  call  the  ELI  CIRCUS  ALGORITHM  DESIGN 
DISCIPLINE  ,  i.e.,  we  think  of  a  model  architecture  that  generalizes  the  one  based  on  the 


B  megabytes/sec 


Vale  Shared-Bulk-Memory  as  shown  in  the  sketch  above.  What  is  important  for  the  inventor  and 
implementor  of  parallel  algorithms  is  the  overall  simple  structure  of  this  target  architecture.  This 
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structure  is  sufficiently  simple  to  allow  the  formulation  and  analysis  of  parallel  algorithms.  The 
exact  details  of  the  hardware  implementation  are  completely  unimportant  at  this  level.  Indeed  we 
envision  having  an  automatic  system  for  mapping  algorithms  for  the  ELI  CIRCUS  onto  specific 
implementations. 


3.  A  Computation  Intensive  Ocean  Acoustics  Problem 

We  consider  the  problem  of  modelling  the  propagation  of  a  single  frequency  underwater  acoustic 
signal  by  means  of  a  3-dimensional  parabolic  approximation  [2]  such  an  approximation  corresponds 
to  the  equation 


(i) 


u,  =  iM"1  -  1)K  +  ^  J3—. 


[2,  3].  Algorithms  for  solving  Eq.  (1)  are  known  to  be  very  computation  intensive  so  that  the 
application  of  a  multi-array  processor  is  quite  appropriate. 

We  consider  two  basic  algorithms  for  computing  the  solution  to  Eq.  (l)  : 


1.  The  split  step  Fourier  method  due  to  Fred  Tappert,  [4],  which  involves  two  2-dimensional  FFTs 
per  range  step;  and 

2.  The  explicit  finite  difference  approximations  due  to  Tony  Chan,  Ding  Lee,  and  Long-jun  Shen, 
[l],  which  involves  a  matrix-vector  product  with  a  5-diagonal  matrix  per  range  step. 

We  now  describe  implementations  and  analyze  the  performance  of  these  two  algorithms  on  an 
ELI  CIRCUS  architecture. 


4.  The  Split  Step  Fourier  Method. 

Since  most  of  the  work  for  the  split  step  algorithm  is  in  the  2-dimensional  FFTs,  we  limit  our 
discussion  to  that.  In  particular,  we  consider  a  2-dimensional  FFT  on  a  2"  x  2"  mesh  using  a  k 
processor  ELI  CIRCUS  machine.  The  general  idea  is  to  use  an  algorithm  of  the  following  form. 

2D-FFT  ALGORITHM  FOR  CIRCUS  ( k )  : 

1.  Partition  the  mesh  into  k  equal  vertical  pieces  (assuming  k  divides  2”)  ; 

2.  Map  each  piece  into  an  AP; 

3.  Do  the  1-dimensional  FFTs  in  each  AP; 

4.  Map  the  transformed  data  back  into  shared  memory; 

5.  Partition  the  mesh  into  k  equal  horizontal  pieces  (transpose  done  in  hardware); 

G.  Map  each  piece  into  an  AP; 

7.  Do  the  1-dimensional  FFTs  in  each  AP; 

8.  Map  the  transformed  data  back  into  shared  memory. 

To  apply  this  type  of  approach  to  the  split  step  Fourier  algorithm  we  take  A"  =  2"  points  in 
both  the  :  (depth)  and  0  (angle)  variables  and  assume  the  number  of  processors,  k.  divides  N.  In 
each  of  steps  3  and  7  of  the  above  algorithm,  each  processor  does 
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l-dimensional  FFTs  on  A’  points  which  requires 


(2) 


N  41V  log  Ar 
k  s 


seconds. 


We  can  accomplish  steps  4-6  by  a  variety  of  means.  For  example,  by  having  each  processor  use  the 
broadcast  bus  to  broadcast  its  transformed  data  to  all  the  others  which  would  take 
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where  T  is  the  data  transfer  startup  time  or  latency.  Summing  two  times  the  bound  in  (2)  and  (3) 
we  get  a  total  time  of 
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kr+!*  +  V,*y  seconds, 
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which  is  graphed  in  Figure  1  as  a  function  of  k.  If  B  >  b ,  we  can  use  the  shared  memory  to  more 
effectively  move  the  data.  In  fact,  steps  4-6  would  require 
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(5) 

This  yields  a  total  time  of 

(6) 

which  is  less  than  (4)  if  26  <  B. 

Still  another  way  of  accomplishing  the  data  movement  in  steps  4-6  is  to  use  the  direct  nearest 
neighbor  connections  as  follows.  We  base  this  discussion  on  the  model  with  the  local  shared  memory 
previously  described,  i.e.,  the  Yale  Shared-Bulk-Memory. 

PIPELINE  DATA  MOVEMENT  ALGORITHM  (&)  : 

For  j  ss  k  -  1  to  1  step  -  1  : 

1.  Each  AP  writes  ^  data  to  its  right  shared  memory 


2.  Each  AP  reads  the  data  irom  its  left  shared  memory 
Using  this  approach  the  total  time  is  given  by 


1  A2  SA'2  log  A" 
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2  c  ks 


seconds. 


yielding  a  graph  which  is  basically  similar  to  that  displayed  in  Figure  1. 

It  is  clear  from  Figure  1  that  as  we  add  more  nodes  the  running  time  will  eventually  increase 
and  may  eventually  exceed  the  running  time  with  one  node. 

Let's  examine  some  typical  cases  for  N  =  1024. 

1.  VAX  and  FPS-164s  :  T  =  3  x  10"3,  6  =  0.3  x  106,  s  =  10  x  106 


TOTAL  TIME  =  3000/c  + 


S  x  106  8  x  106  x  10 

0.3  x  106  10  x  106  x  k 

seconds. 


seconds 


Thus  for  k  =  2;  TOTAL  TIME  %  28  seconds 

k  =  3;  TOTAL  TIME  fa  26.3  seconds 

k  =  oc;  TOTAL  TIME  «  24  seconds 

However,  we  remark  that  for  a  single  processor  system  the  TOTAL  TIME  =  8  seconds  because 
there  is  no  data  movement. 

2.  APTEC  BUS  and  FPS-164s  :  T  =  3  x  10~3.  b  =  3  x  106.  s  =  10  x  10c 

g 

TOTAL  TIME  ss  2.4  +  —  seconds 

k 

Thus  for  k  =  2;  TOTAL  TIME  fa  6.4  seconds 

k  =  4;  TOTAL  TIME  4.4  seconds 

k  =  oc;  TOTAL  TIME  2.4  seconds 

There  is  an  interesting  question  about  the  use  of  the  shared  memory  on  the  APTEC  BUS.  Suppose 
b  can  have  any  value  between  3  and  12  megabvtes/sec.  The  question  is  how  hard  we  should  try 
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to  make  l  large.  If  k  b  >  24.  e.g..  if  k  >  S  then  using  the  shared  memory  approach  we  have  from 
equation  (C)  that 

16  8 

TOTAL  TIME  (MEMORY)  *  —  +  -  seconds. 

K 

while  using  only  the  broadcast  bus  we  have  from  equation  (4)  that 

TOTAL  TIME  (BUS)  ss  |  |  seconds. 

0  rv 

Thus  if  k  b  >  24.  we  should  use  the  shared  memory  approach  and  there  is  no  advantage  in  making 
b  >  3. 

5.  An  Explicit  Finite  Difference  Scheme 

\Ye  now  switch  to  discussing  an  explicit  5-point  finite  difference  approximations  to  the  para¬ 
bolic  equation.  If  we  block  the  unknowns  by  lines  then  we  can  step  the  solution  out  in  range  by 
using  the  explicit  scheme 

un+1  =  Aun  +  Du"-1  +  gn 

where  A  and  D  are  A’-block  tridiagonal  (with  an  additional  two  off  diagonal  blocks  because  of  the 
periodicity  in  the  6  variable).  If  we  partition  the  data  into  k  equal  vertical  slices  each  of  which  is 
assigned  to  a  processor  then  for  each  range  step  each  processor  must  send  its  left-most  vertical  line 
of  data  to  its  left-hand  neighbor  and  its  right-most  vertical  line  of  data  to  its  right-hand  neighbor. 
If  we  number  the  processors,  we  can  envision  this  algorithm  for  data  movement  as  follows.  In 
sequence,  each  odd  numbered  processor  first  broadcasts  its  left-most  vertical  line  of  data  and  then 
its  right-most  vertical  line  of  data.  Then  the  even  numbered  processors  do  the  same.  This  requires 
data  movement  time  equal  to 


S.V  k  k  \  1G  XI: 

~J~  '  o  +  9  ‘  ^  )  ~  ^ T  H - jj —  seconds. 


However,  the  compute  time  (for  the  A'-block  tridiagonal  part)  is  given  by 


yielding  a  total  time  of 


seconds, 


,  ^  IGA' A-  20A'2 
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seconds. 


The  graph  of  this  expression  (10)  as  a  function  of  k  is  similar  to  the  graph  of  Figure  1.  However, 
the  asymptotic  behavior  of  (10)  as  k  increases  is  significantly  worse  than  that  of  (3)  because  in  (10) 
k  appears  in  the  numerator  of  the  data  transfer  term  as  well  as  the  startups  term  while  in  (3)  it 
appears  in  the  numerator  of  only  the  startups  term.  Moreover,  the  data  transfer  term  is  likely  to 
be  more  important  than  the  startups  term. 

However,  now  the  story  is  quite  different  from  the  split  step  algorithm.  When  we  do  the  data 
movement  of  the  previous  algorithm  using  the  nearest  neighbor  connection  links  instead  of  the 
broadcast  bus,  we  can  use  all  the  links  in  parallel  and  hence  the  time  to  move  the  data  is  given  by 
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seconds, 
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and  the  runing  time  is  given  by 


^  l6Ar  20N2 

2  T  + - +  — r- 

c  sk 


seconds. 
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ime 


Figure  2 
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which  is  pictured  in  Figure  2  and  is  monotonically  decreasing  in  k.  As  before  we  give  some  explicit 
examples  with  N  =  1024. 

1.  VAX  with  FPS-164s  :  T  =  3  x  10-3,  b  =  0.3  x  106,  s  =  10  x  106  (using  the  bus). 


TOTAL  TIME  = 


1600 Ok 
0.3  x  106 


3000 k 

+  + 


20  x  10® 

10  x  106  x  k 


seconds. 


ss<  24 k  +  r 
k 


seconds. 


Hence,  kopt  i*  6  and  (TOTAL  TIME)op,  s*  0.6  seconds. 

2.  APTEC  BUS  and  FPS-164s  :  T  -  3  x  10-3,  t  =  3x  106,  s  =  10  x  106  (using  the  bus). 

Now  kopt  fst  30  and  (TOTAL  TIME)op,  «  0.15  seconds. 

3.  ELI  CIRCUS  using  nearest  neighbor  direct  links  :  c  =  20  x  106 


TOTAL  TIME  « 


16  x  1000  ,  20  x  106 

20  +  10  x  106  x  k 


seconds 


k.  —  seconds. 

hr 

Hence,  k  ~  100  implies  a  TOTAL  TIME  0.02  seconds. 


6.  Summary 

The  solution  of  the  3-dimensional  parabolic  approximation  is  computation  intensive.  The 
algorithm  of  choice  on  a  multi-processor  system  is  very  dependent  on  the  system  architecture. 
For  bus-based  architectures  the  split  step  Fourier  method  is  likely  to  be  superior  to  explicit  finite 
difference  methods  while  on  a  ring-based  machine  with  independent  nearest  neighbor  interconnects, 
the  opposite  is  probably  true. 

Furthermore,  for  reasonable  system  parameters  we  are  likely  to  see  almost  linear  speedups  for 
small  to  moderate  number  of  processors.  However,  for  bus-based  architectures  the  running  time 
will  eventually  start  to  increase  as  the  number  of  processors  is  increased. 
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